ISM@FIRE-2015: Mixed Script Information Retrieval

نویسندگان

  • Dinesh Kumar Prabhakar
  • Sukomal Pal
چکیده

This paper describes the approach we have used for identification of languages for a set of terms written in Roman script and approaches for the retrieval in mixed script domain, in FIRE-2015. The first approach identifies the class (native language of terms and whether a term is any named entity or of any other type) of given terms/words. MaxEnt a supervised classifier has been used for the classification which performed best for strict f-measure NE has score is 0.46 and strict f-measure NE_P has score 0.24. For the MSIR subtask Divergence from Randomness (DFR) based approach is used and performed better with block indexing and query formulation. Overall scores of our submission on NDCG@10 0.4335, 0.5328, 0.4489 and 0.5369 for ISMD1, ISMD2, ISMD3 and ISMD4 respectively. .

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mixed Script Ad hoc Retrieval using back transliteration and phrase matching through bigram indexing: Shared Task report by BIT, Mesra

This paper describes an approach for Mixed-script Ad hoc retrieval, a subtask as part of FIRE 2015 Shared Task on Mixed Script Information Retrieval. We participated in subtask 2 of the shared task, where a statistical model was used to carry out back transliteration to Devanagari script. To perform the search, bigram based index of the documents were used and search was performed using pivot t...

متن کامل

DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval

This paper aims to describe the methodology followed by Team Watchdogs in their submission for the shared task on Mixed Script Information Retrieval (MSIR) in FIRE 2015. I participated in the subtask 1 (Query Word Labelling) and 2 (Mixed-script Ad hoc retrieval). For subtask 1, Machine Learning approach using CRF classifier was used to classify the tokens as one of the possible languages using ...

متن کامل

AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text

The progression of social media contents, similar like Twitter and Facebook messages and blog post, has created, many new opportunities for language technology. The user generated contents such as tweets and blogs in most of the languages are written using Roman script due to distinct social culture and technology. Some of them using own language script and mixed script. The primary challenges ...

متن کامل

Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification

In social media communication, code switching has become quite a common phenomenon especially for multilingual speakers. Automatic language identification becomes both a necessary and challenging task in such an environment. In this work, we describe a CRF based system with voting approach for code-mixed query word labeling at word-level as part of our participation in the shared task on Mixed ...

متن کامل

Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval

The Transliterated Search track has been organized for the third year in FIRE-2015. The track had three subtasks. Subtask I was on language labeling of words in code-mixed text fragments; it was conducted for 8 Indian languages: Bangla, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, Telugu, mixed with English. Subtask II was on ad-hoc retrieval of Hindi film lyrics, movie reviews and astr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015